EDA Code for Spatial Exploration

Import all libraries used.

Run this if plotly plots don't show

Set plot style for matplotlib.

Load all processed weather stations.

Let's begin our spatial exploration. First, let's look at each feature across the 10 weather stations, and let's look at the standard deviation at each time step. A larger standard deviation indicates greater spatial variability and less spatial correlation between features (that is, more new information is being provided).

It seems like standard deviation isn't really high, considering the variability of weather features. Let's continue our exploration by plotting correlation matrices for each feature, where a value closer to 1 indicates greater linear correlation.

Note that some weather stations have a correlation of near 1; we can claim these are colinear. Let's write a function to figure out which stations need removal for which feature. A threshold of >=0.99 seems reasonable for this.

These features in these stations will not be considered in the model training -- they are redundant and serve little purpose.

Next, let us see if there any natural clusters for features, namely, the precipitation amount. But before that, as cluster algorithms suffers from the curse of dimensionality due to the Euclidiean distance metric losing meaning, PCA will be used to reduce dimensionality.

From these 3D visualizations. Only PRECIP_AMOUNT and STATION_PRESSURE appear to have distict clusters. Let's attempt to cluster both. Cluster labels could possibly be used as a feature. First, let's plot inertia for PRECIP_AMOUNT to determine the optimal number of clusters. It seems that 4 clusters is reasonable.

Let's visualize the cluster results. First, let's visualize the precipitation clusters.

Let's create a 3D visualization for the precipitation clusters, which is animated.

Let's view the STATION_PRESSURE clusters in 3D.

There is poor performance for K-means clustering for pressure. Let's use and visualize a Gaussian mixture model instead.

Much better!

Let's see if the precipitation clusters have any meaning in terms of the radar images. Let's load the radar images as a memory-mapped array, as to save memory.

And let's plot our results.

Not much meaning temporally! Now, let's explore the temporal distribution of the STATION_PRESSURE clusters we found earlier...

That's very strange. Let's also look at the average pressure across stations.

This shows that the spatial distribution of the clusters is basically the same. Perhaps STATION_PRESSURE is too correlated to exhibit any meaningful clusters, after all. Let's investigate the correlation between the precipitation from the weather stations and from the weather radar. As the weather radar is quite coarse in resolution, let's compare maximum precipitation.

Let's get the correlation coefficient first.

This correlation coefficient is quite low--showing that the radar data isn't closely correlated with the precipitation data. This could mean two things: (1) the radar data is introducing more information or (2) the radar data is showing error which makes it differ from weather station precipitation data.

Let's plot the trend.

Radar data usually always shows a higher maximum than weather station data. This makes sense, as the weather radar covers a larger geographical area, and is thus more likely to pick up on a higher value. This also indicates the weather radar data is useful: it provides new information in addition to weather station data.